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ABSTRACT 


Randomized A/B tests in educational software are not run 
in a vacuum: often, reams of historical data are available 
alongside the data from a randomized trial. This paper 
proposes a method to use this historical data—often high- 
dimensional and longitudinal-to improve causal estimates 
from A/B tests. The method proceeds in two steps: first, fit 
a machine learning model to the historical data predicting 
students’ outcomes as a function of their covariates. Then, 
use that model to predict the outcomes of the randomized 
students in the A/B test. Finally, use design-based methods 
to estimate the treatment effect in the A/B test, using pre- 
diction errors in place of outcomes. This method retains all 
of the advantages of design-based inference, while, under cer- 
tain conditions, yielding more precise estimators. This pa- 
per will give a theoretical condition under which the method 
improves statistical precision, and demonstrates it using a 
deep learning algorithm to help estimate effects in a set of 
experiments run inside ASSISTments. 


1. INTRODUCTION 

Randomized A/B tests hold a lot of promise for the study 
of student learning within intelligent tutors. Not only do 
they allow for causal inference without fear of confounding, 
but they also allow for “design-based” effect and standard 
error estimates that are virtually guaranteed to be unbiased 
[13]. These strengths are to the fact that data analysts know 
exactly how, and with what probability, conditions were as- 
signed to subjects. 


The traditional tools for analyzing experiments estimate ef- 
fects using only data from the experiments themselves, dis- 


carding data from (potential) subjects that were not ran- 
domized. For instance, data from past users or from con- 
current users who, for whatever reason, were not included 
in the A/B test, are excluded from the analysis. We refer to 
users such as these, with similar covariate and outcome data 
as the participants in the A/B test, but who were not ran- 
domized, as the “remnant.” Excluding the remnant makes 
good statistical sense: after all, the probabilities of assign- 
ment are known only for participants in the experiment, 
not for the remnant. However, data from the remnant may 
be quite useful—in particular, the extra sample size could 
improve the statistical precision, i.e. reduce the standard 
errors, of experimental effect estimates. This is especially 
the case for experiments run within intelligent tutors or 
other big data environments. Vast amounts of log data, 
collected prior to the experiment, in conjunction with pow- 
erful machine-learning methods, could help sharpen causal 
estimates considerably. 


This paper introduces a method for using the remnant in an- 
alyzing experiments, without sacrificing any of the benefits 
of experimentation or making additional modeling assump- 
tions. The core of the method is residualization—predicting 
experimental subjects’ outcomes using a model fit to the 
remnant, and then estimating effects using prediction resid- 
uals instead of the outcomes themselves. We call the method 
“remnant-based residualization,” or “rebar.” Rebar builds on 
methods suggested in [9], [7], and [1]. Rebar was first intro- 
duced in [12] as a method to reduce confounding bias in 
observational studies. Here we show that rebar can also re- 
duce standard errors in randomized A/B tests, particularly 
in educational data mining contexts. 


Most other methods incorporating machine learning into 
analysis of experiments, either to estimate average effects 
(e.g. [16]), to estimate subgroup effects (e.g. [3]), or to 
optimally allocate treatment (e.g. [11], [19]) use machine 
learning to replace, rather than complement, design based 
methods. This is a very promising avenue of research, but 
lacks the statistical guarantees of well-worn design-based es- 
timates. 


Proceedings of the 11th International Conference on Educational Data Mining 479 


The next section will formally introduce rebar, and sections 
3, 4, and 5 will illustrate it using a deep-learning model to 
sharpen effect estimates in a set of 22 experiments run within 
the ASSISTments system [14]. Section 6 will conclude. 


2. REBAR 
2.1 Experiments and Modeling 


To learn if an intervention worked, or to figure out which of 
two conditions (say, condition 0 and condition 1) produces 
better outcomes, statistical models can often be quite help- 
ful. To take a common example, analysts might regress an 
outcome Y on an indicator for condition Z, along with a 
vector of covariates a. Then, the estimated coefficient on 
Z is taken as the estimated effect of condition 1 versus 0, 
controlling for aw [18]. 


The shortcomings of this approach are well-known: if the 
vector x is missing a confounder—a covariate that predicts 
both subjects’ choice of condition, 0 or 1, and outcomes Y— 
then the regression estimate will be biased. Moreover, even 
if there are no unmeasured confounders, if the regression 
model is misspecified, for instance, modeling the relation- 
ship between Y and 2 as linear, then the estimate will also 
be biased. On the other hand, a regression model may be 
run on all available data, producing precise (if inaccurate) 
estimates. 


Randomized experiments correct regression’s faults. If sub- 
jects are randomly assigned to conditions 0 or 1, then the 
difference in mean outcomes between the two groups is an 
unbiased estimate of the average treatment effect. More 
precisely, following [15] and [10], let yi; be the outcome a 
subject i would experience if assigned to condition 1, and 
let yo: be the outcome i would experience under condition 
0. A subject’s observed outcome Y; = yi; if i is assigned to 
1, Z; = 1; Y; = yo: if 7 is assigned to 0. (Since observed 
outcomes Y are a function of Z, they are random; we may 
model potential outcomes yo and y1 as fixed.) 


Under this framework, we define causal effects based on po- 
tential outcomes yo and yi, rather than observed outcomes 
Y. An individual 2’s treatment effect 7; is the difference of 
those two: 7; = y1i—Yyoi—the difference between i’s outcome 
under treatment versus under control. Without strong as- 
sumptions, these individual effects are not identified by the 
data; instead, we estimate quantities such as the average 
treatment effect (ATE) over all subjects T = 30, 7%/n, or 
the average effect of the treatment on the treated (TOT) 
Tz-1 = 0, Ziti/n1, where n and nj are the total num- 
ber of subjects and the number of treated subjects, respec- 
tively. In a simple randomized experiment, the ATE and 
TOT have the expectation, but their estimators may have 
different standard errors. For the sake of simplicity, we will 
focus on the TOT. 


Observed outcomes may be used to estimate the ATE, TOT, 
or other causal parameters. In particular, an unbiased esti- 
mator of the TOT is: 


_ yZ=! _ yZ=0 


> 


where Y7>1 is the mean of Y for treated subjects, 
do, ZYi/m, and Y 7=° is the mean of Y in the control group. 


An unbiased estimator of the squared standard error is: 
SEtor = n/(nino)s*(Y)z=0 


where s? (Y)z=o is the sample variance of Y in the control 
group. See the Technical Appendix, and [4] for more details. 
Estimators 7 and SEror, and their properties, are derived 
solely from the experimental design, via survey sampling 
theory; they do not depend on the (unknown) distributions 
of yi; and yo, or any other modeling assumptions. They are 
“design-based.” 


In a randomized experiment there are no confounders. Since 
the probability distribution of Z is known exactly, no sta- 
tistical models, or modeling assumptions, are necessary— 
the analysis may be “design-based” instead of model-based. 
In particular, the estimate 7 and its standard error derive 
from survey sampling theory, not the distribution of yo or 
yi. On the other hand, any data from the “remnant” of 
an experiment—the set of subjects outside the experiment, 
who were not randomized to either condition—play no role 
in this analysis. Since subjects in the remnant were not ran- 
domized, there is no telling how they may differ from the 
Z =0or Z = 1 groups, in ways measured or unmeasured, 
and there is no telling (exactly, statistically) how their data 
came to be, so design-based analysis is impossible and any 
model fit to the remnant is most likely misspecified. How- 
ever, though dropping the remnant from the analysis brings 
unbiasedness, it also brings a loss of precision—all that sam- 
ple size, thrown away. 


2.2 A Role for the Remnant 


Assume the following setup: a set of users, “the experimen- 
tal set” were randomized to either condition 0 or condition 1, 
and their outcomes Y were measured at the end of the exper- 
iment. Conditions 0 or 1 could be two different treatments, 
or control and treatment condition; we will refer to condition 
0, as “control” and 1 as “treatment.” The goal of the experi- 
ment is to estimate the TOT, 7z=1, the average effect in the 
treatment group. Some more subjects, the remnant, were 
not randomized; instead, they all received condition 0, the 
default (this isn’t strictly necessary—the theory also works if 
they received condition 1, a mix of conditions, or something 
else altogether—but it makes things simpler). Outcomes Y 
were also measured for members of the remnant. Finally, 
a set of covariates x, possibly high-dimensional, of mixed- 
types, and/or longitudinal, were measured for everyone, in 
the experimental set and in the remnant. 


Experimental estimates typically drop the remnant, and pay 
the price of lower precision. Instead, we suggest training 
a machine-learning model on the remnant, and using it to 
“residualize” the data from the experimental set—that is, es- 
timate effects using prediction residuals. We call this algo- 
rithm “remnant-based residualization” or “rebar.” The pro- 
cess is as follows: 


1. Using data from the remnant, train a model Go(-) to 
predict yo as a function of a. 


2. Validate §o(-) (using cross-validation or other tech- 
niques). if it performs well, proceed; otherwise return 
to step 1, choosing a different model. 
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3. Use §o(-) and covariates x in the experimental set to 
generate predicted outcomes §o(a) and residuals, e = 


Y — Go(a). 
4. Estimate the TOT as a difference in mean residuals, 
Trebar = €Z=1 — €Z=0 
with estimated standard error 


SErevar = /n/(nino)s(e)z=0 


Where s(e)z=0 is the sample standard deviation of e 
in the control group. 


Just like the traditional estimator 7, the rebar estimator 
Trebar is design-based—its logical basis is the designed ex- 
periment, not a model. On the other hand, it harvests in- 
formation from the remnant to improve upon 7. 


Rebar works because the predictions jo(a) were generated 
from an external sample—the remnant—and pre-treatment 
covariates a. Subject i’s prediction §o(a;) will be the same 
whether i is assigned to 0 or to 1. Since there’s no treat- 
ment effect on §o(a), subtracting §o(a) from Y only removes 
noise—not part of the treatment effect. When treatment 
is randomized, Z is independent of go(a), so, in expecta- 
tion, the mean of §o(a) will be equal across the two treat- 
ment groups. In fact, the rebar estimator can be re-written 
as Trebar = Yz=1 — Yz=0 — (Yo(x)z_, — Yo(@)z_o)- The 
first term is 7, which is unbiased for the TOT. The sec- 
ond term is the difference in means of jo(a), which is zero 
in expectation—therefore, 7;ebar is unbiased. This property 
holds not just for the difference-in-means estimator—rebar 
can sharpen any treatment effect estimator that is linear in 
Y and unbiased. 


Rebar’s main tool is the model §o(-), which predicts yo as 
a function of z. In EDM settings, the dimension of avail- 
able covariates is often very large, and sample sizes are of- 
ten large as well—machine learning algorithms make strong 
candidates for §o(-). go(-) is not a statistical model per se, 
estimating the parameters of a probability distribution, but 
as a tool for prediction. It need not be correct in any sense, 
and its estimates need not be unbiased or consistent. Since 
§o(-) is fit on a separate sample from the experimental sub- 
jects, the process of fitting it—steps 1 and 2 above—do not 
affect standard errors, and model misspecification does not 
lead to bias. 


On the other hand, for rebar to be more precise than the 
usual difference in means, fo(-) must be able to generate 
decent predictions of yo in the experimental set. This will 
be the case if §o(a) is a good prediction of yo—by residual- 
izing, we subtract out the component of Y’s variance that 
is predicted by o(-). The variance of the rebar estimator 
is proportional to the difference between the mean-squared 
prediction error of jo(-), MSE = ||yo — §o(x)||?/n and its 
squared bias. (Recall that both 7 and 7rebar are unbiased 
estimates of the TOT; the bias here refers to §o(-)’s predic- 
tions of yo, not to treatment effect estimates.) The extent 
to which it outperforms the standard estimate 7, measured 
as percent improvement (SE7or — SE? yar) /SEPor, is at 
least as large as §o(-)’s prediction R? in the control group, 
R? =1-||Y — Go(x)||Z=0/|IY — Y||Ze0 (see the Technical 


Appendix for derivations). If §o(-) performs poorly in the 
control group—so that ||Y — go(a)|| > ||Y — Y||then this 
R’, as we have defined it, could be negative, and 7;cbar will 
be less precise than 7; however, it will still be unbiased. The 
improvement 7;cbar offers rests entirely on the performance 
of §o(-). The better we can predict how subjects would have 
performed in the control condition, the more precisely we 
can estimate treatment effects. 


Since Go(-) is trained in the remnant, its performance in the 
experimental set (as measured by, e.g. prediction R?) will 
be hard to gauge at the outset. If the distribution of Y, con- 
ditional on a, differs widely from the between the two sets, 
§0(-)’s performance may suffer in extrapolation. This prob- 
lem is not fatal: the rebar estimate is unbiased regardless 
of §o(-)’s properties. However, a model with poor predic- 
tive power in the experimental set will not reduce standard 
errors substantially, and may increase them. Of course an 
analyst may calculate go(-)’s R? in the experimental set, but 
choosing a model based on Y induces dependence between 
Y and Go(a), and may cause bias. Models trained on a sub- 
set of the remnant that resembles the experimental set—or 
which weight such a remnant more heavily—may perform 
better than those trained on the entire remnant. 


The previous discussion assumed simple randomization. 
However, rebar easily extends to more complex designs, in- 
cluding experiments with more than two treatment condi- 
tions. Further, as we will illustrate below, rebar can be ex- 
tended to regression estimators of causal effects as well, mod- 
eling low-dimensional covariates within sample and high- 
dimensional covariates out of sample. 


3. DATA: 22 EXPERIMENTS AND MORE 


The 22 experiment dataset is a feature-rich dataset on 
students who participated in randomized controlled trials 
(RCTs) ran inside a free, online tutoring called ASSIST- 
ments [14]. This dataset consists of student-level data from 
8,205 unique students participating in 22 A/B tests, 14,947 
unique student-RCT pairs in total. 


These RCTs were run within skill builders. Inside ASSIST- 
ments, a skill builder is a type of problem set that requires 
students to practice solving problems until they master the 
associated skill. Skill mastery is determined by the student’s 
ability to answer a certain number of problems correctly, 
usually three, in a row. 


This feature-rich dataset includes 30 features, including 
both categorical features, such as student grade levels, and 
numerical features, such as student performances prior to 
the experiment. This dataset also includes two dependent 
measures. The first dependent measure, “completion,” is 
whether the student completed the assignment and achieved 
mastery. The other dependent measure is the number of 
problems attempted; for students who achieved mastery, this 
may be interpreted as mastery speed. The analysis in this 
paper will focus on the first dependent measure, completion. 


4. DEEP LEARNING TO PREDICT COM- 
PLETION 
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As described in Section 2.1, the model go(-) is an integral 
part of the rebar methodology with the purpose of producing 
predicted outcome yo as a function of a. The methodology 
does not rely on a specific type of model, nor any specific 
algorithm to be used so long as an estimate for the outcome 
variable of interest is generated by the model from included 
covariates. As also stated in that section, the accuracy of 
the model, however poor, does not lead to bias. That said, 
models that are more accurate at estimating the outcome 
variable of interest will likely lead to better estimates of 
treatment effects. Deep learning models have been previ- 
ously applied in educational contexts with promising results, 
often reporting higher performance over existing methods 
[8][6][2]. While the application of such methods is not ap- 
propriate to all problem applications due to the size and 
complexity, the use of such models in this work is justified 
due to 1) the scale of data available for model training and 
2) the inconsequence of producing an uninterpretable model 
(e.g. the significance and coefficients of individual variables 
in the model are not intended for study or knowledge discov- 
ery). What is needed, again, is simply a prediction model. 


We develop and apply a type of deep learning model known 
as a long short term memory (LSTM) [5] network. This 
model is a variant of a recurrent neural network (RNN) [17] 
that is commonly applied to time series data to model tem- 
poral relationships within the sequences. The model pro- 
duces its estimates for each time step by utilizing both co- 
variates provided corresponding to the current time step as 
well as information from all previous time steps within the 
series. As such, the model is developed as a sequence-to- 
sequence method that observes a sequence of student data 
as input and produces a sequence of outcome estimates of 
equal length. The model structure utilizes a 3-layer design, 
with an input layer feeding into a recurrent hidden layer 
(represented as a layer also connected to itself in previous 
time steps), before then proceeding to an output layer. 


Two separate datasets are used to train and apply the model. 
The application dataset, comprised of student data from the 
22 experiment dataset combined with assignment-level in- 
formation for all work each student started before begin- 
ning the respective experiment. In an attempt to reduce the 
complexity of the data from which the model must learn, 
the sequence length of student assignment history is limited 
such that no more than 10 prior assignments are included 
for each student. In other words, students who were in a sin- 
gle experiment have a sequence length of 10, with the last 
time step representing the most recent assignment prior to 
beginning the experiment. Conversely, students in multiple 
students may exhibit sequences longer than 10 if partici- 
pation in the experiments was separated by fewer than 10 
assignments. The dataset is comprised of data from 8,297 
distinct students and a total of 130,935 student assignments. 


The second dataset, used to train and validate the model, is 
comprised of student data exclusive to that comprising the 
22 experiment dataset. Student data, again non-inclusive 
of students within the 22 experiment dataset, is collected 
from the non-experimental problem sets found in the ap- 
plication dataset. From these, assignments are randomly 
sampled, with which the dataset is constructed using the 10 
most recent assignments before students begin the sampled 


assignment. This step, helps to ensure a similar structure 
of the dataset to that of the application set. Again, for pur- 
poses of validity, it is important to stress that no students 
are found in both the training and application datasets. The 
dataset contains data from 134,141 distinct students and a 
total of 686,590 student assignments. 


The model uses just 4 assignment-level covariates per time 
step to predict assignment-level performance on the subse- 
quent assignment. These covariates include the simple mea- 
sures of completion of the assignment, the number of prob- 
lems attempted, the number of problems completed, and 
a measure of inverse mastery speed; this last measure is 
a transformation of mastery speed, using 1 divided by the 
number of problems when the assignment was complete, or 
0 when the assignment was not completed. While simple in 
the number of covariates, again, the model also uses infor- 
mation from previous time steps, adding to its complexity 
(ie. time step 2 is informed by time step 1, time step 3 
is informed by time steps 2 and 1, etc.). The model pro- 
duces two values per time step corresponding with the de- 
sired outcome variable of completion of the next assignment, 
and also an estimate of inverse mastery speed on the next 
assignment, using a combined cost of these two measures to 
update model parameters during training; this second mea- 
sure was included as it is believed the model may better 
learn from the data by observing a continuous variable in 
addition to the binary value of completion and also acts as 
an example as to how future works may utilize the same 
methodology to observe beyond the measure of completion 
presented in this work. 


The model is first evaluated using a 5-fold student-level 
cross-validation. The model is trained for multiple epochs, 
or training cycles through the data, using a 30% holdout 
set, sampled from the training set of each fold, to determine 
the stopping point of model training; this holdout set also 
helps stop the model training process before overfitting is 
detected. It is found that the model produces average AUC 
of 0.81 and an RMSE of 0.34 for next assignment completion 
over the 5 folds. Once completed, a final model is trained 
over the entirety of the training dataset and applied to the 
application dataset, which has acted as a holdout set during 
the training and validation process. The next assignment 
completion estimates, collected from the most recent assign- 
ment before students begin each experiment, is then used as 
the estimated value of completion that is used in subsequent 
steps of the rebar analyses. 


5. RESULTS 


We estimated treatment effects of interventions on skill- 
builder completion for the 22 experiments using both raw 
outcomes Y, the usual approach, and using e = Y — G(x), 
the rebar estimator. We also estimated standard errors in 
both cases. We used difference in means estimators, so the 
effect estimates are in units of percentage points—how much 
more likely were students to finish skill builders under the 
treatment condition than under control. 


Figure 1 shows improvement in precision of the rebar es- 
timator over the usual estimator: the difference of the 
two standard errors, divided by the usual standard error, 
(SEusuat — SEvevar)/SEusuat- The x-axis shows the predic- 
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Figure 1: The improvement in precision of effect es- 
timates as a percentage of the usual precision esti- 
mate, [SE(7)—SE(7rebar)|/SE(7), plotted as a function 
of go(-)’s prediction R? in the experimental set. 


tion R? of the deep learning model when extrapolated to 
each of the 22 experimental datasets. In 19 of the datasets, 
the rebar standard errors were lower than the usual stan- 
dard errors. In 15 of those datasets, there was a greater 
than 25% improvement, and in four datasets the improve- 
ment was greater than 40%. The extent of the reduction in 
standard error corresponded closely to the prediction R? of 
§o(-), with the most dramatic improvements occurring when 
R? > 0.5. 


Figure 2 shows the estimated treatment effects and approx- 
imate 95% confidence intervals (two standard errors in each 
direction) for the two sets of estimators. In all but three 
cases, the rebar estimate was slightly closer to zero than the 
usual estimate. This is what we would expect if most of 
the true effects were null, so that reducing the noise of the 
treatment effect estimates would draw them closer to their 
true values. For that reason, although rebar reduced the 
standard errors in almost all of the experiments, it did not 
cause any of the non-significant results to become statisti- 
cally significant. In fact, in two cases it had the opposite 
effect; though this may be disappointing for researchers, it 
is probably more accurate. 


We also used linear regression to estimate treatment effects, 
regressing either indicators for completion or prediction er- 
rors on indicators for treatment assignment and two covari- 
ates: the proportions of students’ prior skill builders com- 
pleted and the proportions of prior skill builder problems 
students worked that they answered correctly. The results, 
available upon request, are nearly identical. Although the 
two covariates improved precision slightly, rebar continued 
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Figure 2: Effect estimates and 95% confidence in- 
tervals for the 22 experiments, using both the usual 
and rebar estimates. Experiments are ordered by 
their estimated effect. 


to dominate the usual estimate. 


6. DISCUSSION 


The rich, high-dimensional, fine-grained data that educa- 
tional technology makes available should be a boon to causal 
inference. However, big data is subject to the same maladies 
as small data—confounding from unmeasured variables, and 
model misspecification. Classical randomized experiments 
remain relevant. 


The same may not be true for classical design based estima- 
tors. Big data may not be able to correct unmeasured con- 
founding and may exacerbate model misspecification, but we 
have shown that it can play a significant role reducing the 
standard errors of treatment effect estimates. The method 
we proposed here retains all of the statistical properties that 
recommend design-based estimators, while, in most cases, 
delivering substantially lower standard errors. We demon- 
strated the method’s effectiveness using a cutting-edge deep 
learning algorithm trained to log data from ASSISTments 
which yielded impressive gains in precision when used to 
analyze a set of 22 experiments. 


Rebar’s most important tool in this exercise was the deep 
learning algorithm, which in 17 of the 22 experiments pre- 
dicted completion better than the within-sample proportion. 
In general, designing prediction algorithms that perform well 
in the target dataset is the central challenge to effectively 
implementing rebar. Along the same lines, the most impor- 
tant open question is how to design diagnostics for predic- 
tion performance that do not rely on “peeking” at the exper- 
imental outcomes. One such diagnostic, termed “proximal 
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validation,” was described in [12]|—extending it to experi- 
mental studies and showing that it works is the next step in 
developing this method. 


Wedding classical randomization-based causal inference 
with modern machine learning and big data can yield 
unbiased, robust, precise treatment effect estimates in 
technology-based educational datasets. 


Technical Appendix 

This discussion roughly follows [4], Section 1.1. The 
TOT SO, Zi(yi — yoi)/n1 may be re-written as 
ni' (0, ¥i- DX; yo:), the difference between the total 
of Y across both treatment groups, and the total that 
would have been observed had everyone received the control 
condition. The first sum is known exactly, but the second 
must be estimated using data from the control group. 
From elementary survey sampling, nY7~° is unbiased 
for $7, yo:. Further, the standard deviation of ny 7-9 
is \/n2(1 —no/n)s?(yo)/no = v/n/(nino)s?(yo), where 
3? (yo) is the sample variance, over the whole sample, of yo, 
and 1— no/n = ni/n is the finite population correction. 
Finally, due to random sampling, s?(Y)z=o is an unbiased 
estimator for s?(yo). Substituting ozo for oo and dividing 
by 1 gives the expression for SEror. 


Each individual treatment effect 7; is the same whether the 
outcome (dependent variable) is Y or e: 


§o(@)) — (y1 — Yo(#)) = yr — Yo 


since §o(a) is invariant to treatment. Therefore, the theory 
supporting standard estimates 7 and SEror applies equally 
tO Trebar and SErebar. In particular, T-esar is unbiased for 
the TOT with consistent standard error estimate SEyebar, 
due to survey sampling theory. 


el eo = (y1 


The sample variance of e in the control group is 


2 _ lle=2llZ=0 
s (e)z=0 = = 1 
_II¥ — Go(x) — (¥ — Go())I1Z=0 
no —1 
_II¥ =~ Go(@)I|Z=0 _ __no (7 —— ’ 
no — 1 no — ie Yz=0 Go(®) 2-0 


or the MSE of Go(-) in the control group, minus its squared 
bias. Since the squared bias is always positive, 


a a 
P(e)neo < |= Bo@)Il2=0 
no — 1 


Therefore, the ratio of the estimated rebar standard error to 
the usual TOT standard error is: 


c . IY — Go(x)||Z—0 
SEror 


Il¥ — Y|lZ—o 
with equality if Go(-) is unbiased. 


s*(e)z=0 
82(Y)z=0 


=1 R3-0 
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